JMaxAlign: A Maximum Entropy Parallel Sentence Alignment Tool
نویسنده
چکیده
Parallel corpora are an extremely useful tool in many natural language processing tasks, particularly statistical machine translation. Parallel corpora for certain language pairs, such as Spanish or French, are widely available, but for many language pairs, such as Bengali and Chinese, it is impossible to find parallel corpora. Several tools have been developed to automatically extract parallel data from non–parallel corpora, but they use languagespecific techniques or require large amounts of training data. This paper demonstrates that maximum entropy classifiers can be used to detect parallel sentences between any language pairs with small amounts of training data. This paper is accompanied by JMaxAlign, a Java maxent classifier which can detect parallel sentences.
منابع مشابه
Sentence Alignment Method Based on Maximum Entropy Model Using Anchor Sentences
The paper proposes a sentence alignment method based on maximum entropy model using anchor sentences to align ancient and modern Chinese sentences in historical classics. The method selects the sentence pairs with the same phrases at the beginning or the end of the sentence or with the same time phrases as anchor sentence pairs, which are employed to divide the paragraph into several sections. ...
متن کاملComparison of Alignment Templates and Maximum Entropy Models for Natural Language Understanding
In this paper we compare two approaches to natural language understanding (NLU). The first approach is derived from the field of statistical machine translation (MT), whereas the other uses the maximum entropy (ME) framework. Starting with an annotated corpus, we describe the problem of NLU as a translation from a source sentence to a formal language target sentence. We mainly focus on the qual...
متن کاملParallel Sentences Mining From The Web
Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences includ...
متن کاملA Beam-Search Extraction Algorithm for Comparable Data
This paper extends previous work on extracting parallel sentence pairs from comparable data (Munteanu and Marcu, 2005). For a given source sentence S, a maximum entropy (ME) classifier is applied to a large set of candidate target translations . A beam-search algorithm is used to abandon target sentences as non-parallel early on during classification if they fall outside the beam. This way, our...
متن کاملDiscriminant Models for Word Alignment
Word alignment aims to link each word of a translated sentence to its related words in the source sentence. Nowadays, Giza++ is the most used word alignment system. This toolkit implements the generative IBM models. Despite its popularity, several limitations remain. We thus propose to address this task using discriminative models (Maximum Entropy and Conditional Random Fields) which can easily...
متن کامل